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Abstract 

Two  theorems  and  a  lemma  are  presented  about  the  use  of  jackknife  es¬ 
timator  and  the  cross-validation  method  for  model  selection.  Theorem  1 
gives  the  asymptotic  form  for  the  jackknife  estimator.  Combined  with  the 
model  selection  criterion,  this  asymptotic  form  can  be  used  to  obtain  the 
fit  of  a  model.  The  model  selection  criterion  we  used  is  the  negative  of  the 
average  predictive  likehood,  the  choice  of  which  is  based  on  the  idea  of  the 
cross-validation  method.  Lemma  1  provides  a  formula  for  further  explo¬ 
ration  of  the  asymptotics  of  the  model  selection  criterion.  Theorem  2  gives 
an  asymptotic  form  of  the  model  selection  criterion  for  the  regression  case, 
when  the  parameters  optimization  criterion  has  a  penalty  term.  Theorem 
2  also  proves  the  asymptotic  equivalence  of  Moody’s  model  selection  cri¬ 
terion  (Moody,  1992)  and  the  cross-validation  method,  when  the  distance 
measure  between  response  y  and  regression  function  takes  the  form  of  a 
squared  difference. 
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1  INTRODUCTION 

Selecting  a  model  for  a  specified  problem  is  the  key  to  generalization  based  on  the 
training  data  set.  In  the  context  of  neural  network,  this  corresponds  to  selecting 
an  architecture.  There  has  been  a  substantial  amount  of  work  in  model  selection 


(Lindley,  1968;  Mallows,  1973;  Akaike,  1973;  Stone,  1977;  Atkinson,  1978;  Schwartz, 
1978;  Zellner,  1984;  MacKay,  1991;  Moody,  1992,  etc.).  In  Moody’s  paper  (Moody, 
1992),  the  author  generalized  Akaike  Information  Criterion  (AIC)  (Akaike,  1973) 
in  the  regression  case  and  introduced  the  term  effective  number  of  parameters.  It 
is  thus  of  great  interest  to  see  what  the  link  between  this  criterion  and  the  cross- 
validation  method  (Stone,  1974)  is  and  what  we  can  gain  from  it,  given  the  fact 
that  AIC  is  asymptotically  equivalent  to  the  cross-validation  method  (Stone,  1977). 

In  the  method  of  cross-validation  (Stone,  1974),  a  data  set,  which  has  a  data  point 
deleted  from  the  original  training  data  set,  is  used  to  estimate  the  parameters  of  a 
model  by  optimizing  a  parameters  optimization  criterion.  The  optimal  parameters 
thus  obtained  are  called  the  jackknife  estimator  (Miller,  1974).  Then  the  predictive 
likelihood  of  the  deleted  data  point  is  calculated,  based  on  the  estimated  parame¬ 
ters.  This  is  repeated  for  each  data  point  in  the  original  training  data  set.  The  fit 
of  the  model,  or  the  model  selection  criterion,  is  chosen  as  the  negative  of  the  aver¬ 
age  of  these  predictive  likelihoods.  However,  the  computational  cost  of  estimating 
parameters  for  different  data  point  deletion  is  expensive.  In  section  2,  we  obtained 
an  asymptotic  formula  (theorem  1)  for  the  jackknife  estimator  based  on  optimizing 
a  parameters  optimization  criterion  with  one  data  point  deleted  from  the  training 
data  set.  This  somewhat  relieves  the  computational  cost  mentioned  above.  This 
asymptotic  formula  can  be  used  to  obtain  the  model  selection  criterion  by  plugging 
it  into  the  criterion.  Furthermore,  in  section  3,  we  obtained  the  asymptotic  form 
of  the  mod.l  selection  criterion  for  the  general  case  (Lemma  1)  and  for  the  special 
case  when  the  parameters  optimization  criterion  has  a  penalty  term  (theorem  2). 
We  also  proved  the  equivalence  of  Moody’s  model  selection  criterion  (Moody,  1992) 
and  the  cross-validation  method  (theorem  2).  Only  sketchy  proofs  are  given  when 
these  theorems  and  lemma  arc  introduced.  The  detail  of  the  proofs  are  given  in 
section  4. 

2  APPROXIMATE  JACKKNIFE  ESTIMATOR 

Let  the  parameters  optimization  criterion,  with  data  set  w  =  {(xuVi),  »  =  L 
and  parameters  9,  be  0^(9),  and  let  denote  the  data  set  with  tth  data  point 
deleted  from  to.  If  we  denote  9  and  9-,  as  the  optimal  parameters  for  criterion  0^(9) 
and  Cu_i{9),  respectively,  Vg  as  the  derivative  with  respect  to  0  and  superscript  t 

as  transpose,  we  have  the  following  theorem  about  the  relationship  between  9  and 

Li. 

Theorem  1  If  the  criterion  function  C^{9)  is  an  infinite- order  differentiable  func¬ 
tion  and  its  derivatives  are  bounded  around  9.  The  estimator  (also  called  jack¬ 
knife  estimator  (Miller,  1974))  can  be  approximated  as 

Li-9c^  -(VjV‘C,,(9)  -  V«V'C.(e))-'VjC,(0)  (1) 

in  which  Ci{9)  =  C„(5)  -  C,^_ff9). 

Proof.  Use  the  Taylor  expansion  of  equation  V#C,„_,(0_i)  =  0  around  9.  Ignore 
terms  higher  than  the  second  order. 


Example  1:  Using  the  generalized  maximum  likelihood  method  from  Bayesian 
analysis^  (Berger,  1985),  if  7r(9)  is  the  prior  on  the  parameters  and  the  observations 
are  mutually  independent,  for  which  the  distribution  is  modeled  as  yli  ~  f{y\x,6), 
the  parameters  optimization  criterion  is 

C„(6)  =  log[  )  =:  log/(y,ix.,^)  +  logir(0).  (2) 

Thus  Ci{6)  —  log/(y, lxi,0).  If  we  ignore  the  influence  of  the  deleted  data  point  in 
the  denominator  of  equation  1,  we  have 

-(V9V‘C(^))'‘"«loB/(y.ixv,«).  (3) 

Example  2:  In  the  special  case  of  example  1,  with  noninformative  prior  7r(fl)  =  1, 
the  criterion  is  the  ordinary  log-likelihood  function,  thus 

Y,  VsV^log/(y,  Izj.fi)  ]“*Vslog/(y,|x,,0).  (4) 

3  CROSS-VALIDATION  METHOD  AND  MODEL 
SELECTION  CRITERION 

Hereafter  we  use  the  negative  of  the  average  predictive  likelihood,  or, 

Tm(w)  = -^  Y  log/(y«t*i.^-i)  (5) 

as  the  model  selection  criterion,  in  which  ri  is  the  size  of  the  training  data  set  w, 
m£  M  denotes  parametric  probability  models  /(y|x,^)  and  ^Vf  is  the  set  of  all  the 
models  in  consideration.  It  is  well  known  that  Tm{ai)  is  an  unbiased  estimator  of 
r(flo,  ^(O)!  of  using  the  model  m  and  estimator  9,  when  the  true  parameters 

are  So  and  the  training  data  set  is  u)  (Stone,  1974;  Efron  and  Gong,  1983;  etc.),  i.e., 

r{eo,6{-))  =  E{Tm{a>)} 

-  £{-log/(y|x,5(w))} 

=  51  log/(V;|2j.®(w))  }  (6) 

in  which  Wn  =  {(x^.y^),  j  =  1,  ...  <:}  is  the  test  data  set,  9(  )  is  an  implicit 
function  of  the  training  data  set  u>  and  it  is  the  estimator  we  decide  to  use  after 
we  have  observed  the  training  data  set  w.  The  expectation  above  is  taken  over  the 
randomness  of  ui,  x,  y  and  The  optimal  model  will  be  the  one  that  minimizes 
this  criterion.  This  procedure  of  using  d-i  and  Tm{a))  to  obtain  an  estimation  of  risk 
is  often  called  the  cross-validation  method  (Stone,  1974;  Efron  and  Gong,  1983). 

Remark:  After  we  have  obtained  6  for  a  model,  we  can  use  equation  1  to  calculate 
0_i  for  each  i,  and  put  the  resulting  6-i  into  equation  5  to  get  the  fit  of  the  model, 
thus  we  will  be  able  to  compare  different  models  m  £  M. 

*  Strictly  speaking,  it  i-.  a  method  to  find  the  posterior  mode. 


Lemma  1  If  ike  probabHiiy  model  f(y\x,&),  as  a  function  ofS,  is  differentiable  up 
to  infinite  order  and  its  derivatives  are  bounded  around  6.  The  approximation  to 
the  model  selection  criterion,  equation  5,  can  be  written  as 

Trr^iw)^-^  T  losf{y.\x,J)--  T  V‘;o5/(y.|z.,0)(0_.  ~e)  (7) 

Tl  A—-'  1^  / 

Proof.  Igoring  the  terms  higher  than  the  second  order  of  the  Taylor  expansion  of 
lo8/(yj  around  6  will  yield  the  result. 

Exarr.ple  2  (continued):  Using  equation  4,  '.ve  lave,  for  the  model  selection  criterion, 
'2m(w)  =  ^  log/(yil3:t,6)  - 

(ri,y.)€w 

-  y\  V‘log/(y.lx,.fl)A->V9log/(y.lx.,6).  (8) 

(*.,yi)ew 

in  which  A  -  VtfVjlog/(yj  If  the  mode!  f{y\x,6)  is  the  true 

one,  the  second  term  is  asymptotically  equal  to  p,  the  number  of  parameters  in  the 
model.  So  the  model  selection  criterion  is 

—  log-likelihood  +  number  of  parameters  of  the  model. 

This  is  the  well  known  Akaike’s  Information  Criterion  (AlC)  (Akaike,  1973). 
Example  i(continued):  Consider  the  probability  model 

/(ylx.fl)  =/3exp(-~f(y,t7e(x)))  (9) 

in  which  /3  is  a  normalization  factor,  €{y,  rje{x))  is  a  distance  measure  between  y  and 
regression  function  r]s{x).  €{■)  as  function  of  6  is  assumed  differentiable.  Denoting^ 
U{9,  X,w)  =  X)(ii,y,)ew  HyitVsixi))'-  2cr*logx-(^jA),  we  have  the  following  theorem. 

Theorem  2  For  the  model  specified  in  equation  9  and  the  parameters  optimization 
criterion  specified  in  equation  2  (example  1),  under  regular  condition,  the  unbiased 
estimator  of 

^  ^(y.>»79(2.))  }  (10) 

asymptotically  equals  to 

^  ^(yi.hflCz.))  + 

-  E  ^y^(yi.'?s(*.)){VsW^.>.<^)}"‘V,f(y„p,-(x,)).  (11) 

(*„v.)ew 

’For  example,  jr(9|A)  =  A/)>(0, (r’/A),  this  corresponds  to 

U(6,X,ui)=  ^{yii  »?»(*•))  +  +  const(A,  o’). 


For  the  case  when  S(y,rig(x))  ~  (y  —  T)g(x))^ ,  we  get,  for  the  asymptotic  equivalency 
of  the  equation  11, 

£{e,w)  4-  X 

n  2 

T  ^V‘nf(5,wHVsV^ZV{0,A,a<)}-‘~Vsnf(ff.w)  (12) 
,,  oy,  dy^ 

in  which  u  =  {(z,,j/,),  i  —  1,  n}  is  the  training  data  set,  Wn  ~  »  = 

1,  A:}  is  the  test  data  set,  and  S{6,w)  =  ^  ^■{y.> ’?6(z>)) • 

Proof.  This  result  comes  directly  from  theorem  1  and  lemma  1.  Some  asymptotic 
technique  has  to  be  used. 

Remark:  The  result  in  equation  12  was  first  proposed  by  Moody  (Moody,  1992). 
The  effective  number  of  parameters  formulated  in  his  paper  corresponds  to  the 
summation  in  equation  12.  Since  the  result  in  this  theorem  comes  directly  from 
the  asymptotics  of  the  cross-validation  method  and  the  jackknife  estimator,  it  gives 
the  equivalency  proof  between  Moody’s  model  selection  criterion  and  the  cross- 
validation  method.  The  detailed  proof  of  this  theorem,  presented  in  section  4,  is 
in  spirit  the  same  as  the  one  presented  in  Stone’s  paper  about  the  proof  of  the 
asymptotic  equivalence  of  AIC  and  the  cross-validation  method  (Stone,  1977). 

4  DETAILED  PROOF  OF  LEMMAS  AND  THEOREMS 

In  order  to  prove  theorem  1,  lemma  1  and  theorem  2,  we  will  present  three  auxiliary 
lemmas  first. 

Lemma  2  For  random  variable  sequence  x„  and  y„,  t/ iimn_oo  Zn  =  * 
lim„_oo  Vn  =  z<  Xn  and  y-n.  are  asymptotically  equivalent. 

Proof.  This  comes  from  the  definition  of  asymptotic  equivalence.  Because  asymp¬ 
totically  the  two  random  variable  will  behave  the  same  as  random  variable  x. 

Lemma  3  Consider  the  summation  h{xi,yi)g{xi,  z).  If  E{h{x,y)\x,z)  is  a 
constant  c  independent  of  x,  y,  z,  then  the  summation  is  asymptotically  equivalent 
lo  cT,iSixi,z)- 

Proof.  According  to  the  theorem  of  large  number, 

lim  -  ^/i( I.,  y,)y(zi, a)  =  E{h{x,y)g{x,z)) 

i 

=  E{E{h{x,  y)|z,  z)g{x,  z))  =  cE{g{x,  z)) 

which  is  the  same  as  the  limit  of  ^  Using  lemma  2,  we  get  the  result  of 

this  lemma. 

Lemma  4  If  Vs{-)  g{0,-)  are  differentiable  up  to  the  second  order,  and  the 

model  y  =  T}i{x)  -(-e  withe  ~  M'{Q,<t^)  is  the  true  model,  the  second  derivative  with 


respect  to  6  of 


u{e,x,u})  ~  -f-  <?(^»>^) 

t=i 

evaluated  at  the  minimum  of  U ,  i.e.,  6,  is  asymptotically  independent  of  random 
variable  {j/^,  i  =  1, n}. 

Proof.  Explicit  calculation  of  the  second  derivative  of  U  with  respect  to  6,  evaluated 
at  9,  gives 

n  n 

V8V‘W(e,A,a;)  =  2;^V,T,,-(x.)Vj;?^(x.)  -  2  ^^(y.  -  V^(xO)V*r,^(z.) 

«=i  t=i 

+  V9V‘ff(e,A) 

As  n  approaches  infinite,  the  effect  of  the  second  term  in  U  vanishes,  6  approach 
the  mean  squared  error  estimator  with  infinite  amount  of  data  points,  or  the  true 
parameters  9q  of  the  model  (consistency  of  MSE  estimator  (Jennrich,  1969)),  E(y  — 
jjj(i))  approaches  E(y  —  T}e^(x))  which  is  0.  According  to  lemma  2  and  lemma  3,  the 
second  term  of  this  second  derivative  vanishes  aisymprotically.  So  as  n  approaches 
infinite,  the  second  derivative  of  1/  with  respect  to  9,  evaluated  ^.t  9,  approaches 

n 

Vi,Vilt(9o),  A,  a.)  =  2  ^  V«r,«,(*.)V‘ q<,„(x.)  +  Vj V^g(flo.  A) 

t=i 

which  is  independent  of  {yi,  t  =  1,  ...,  n}.  According  to  lemma  2,  the  result  of  this 
lemma  is  readily  obtained. 

Now  we  give  the  detailed  proof  of  theorem  1,  lemma  1  and  theorem  2. 

Proof  of  Theorem  1.  The  jackknife  estimator  9-i  satisfies,  ^eCu.,(9-,)  =  0. 
The  Taylor  expansion  of  the  left  side  of  this  equation  around  9  gives 

V«C.,(0)  +  VsVlC^,^{e){L,  -  9)  +  0{\9.i  -  5V)  =  0 

According  to  the  definition  of  9  and  0_j,  their  difference  is  thus  a  small  quantity. 
Also  because  of  the  boundness  of  the  derivatives,  we  can  ignore  higher  order  terms 
in  the  Taylor  expansion  and  get  the  approximation 

Li~9^  -(V,V‘C.,(0))-iVjC,..,(P) 

Since  9  satisfies  VjC„(fi)  =  0,  we  can  rewrite  this  equation  and  obtain  equation  1. 
Proof  of  Lemma  1.  The  Taylor  expansion  of  log/(y,|xi,0_i)  around  9  is 
iogf{yi\xi,9-i)  =  log/(yi|x„fl)  +  V^log/{yi|xi,0)(e_i  -  0)  +  -  fl|*) 

Putting  this  into  equation  5  and  ignoring  higher  order  terms  for  the  same  argument 
as  that  presented  in  the  proof  of  theorem  1,  we  readily  get  equation  7. 

Proof  of  Theorem  2.  Up  to  an  additive  constant  dependent  only  on  A  and  cr^, 
the  optimization  criterion,  or  equation  2,  can  be  rewritten  as 

1 


Now  putting  equation  9  and  13  into  equation  3,  we  get, 

Putting  equation  14  into  equation  7,  we  get,  for  the  model  selection  criterion, 

'^rn{u})  =  ^  »?«(*.))  + 

(i.,V.)€u/ 

~  E  ^’^J^(y..'?«(*.)){VsV‘^/(fl.A,a.)}-‘V,f(y.,,7^(r.))  (15) 

(i., »,)€*" 

Recall  the  discussion  associated  with  equation  6  and  now 

Y  •og/(y;>j>^)  }  E{i  Y  ^^(yj-^«(^;))  }  (16) 

after  some  simple  algebra,  we  can  obtain  the  unbiased  estimator  of  equation  10. 
The  result  is  equation  15  multiplied  by  2<t^,  or  equation  11.  Thus  we  prove  the  first 
part  of  the  theorem. 

Now  consider  the  case  when 

£‘(y.'79(a:))  =  (y- r?«(a5))^  (17) 

The  second  term  of  equation  1 1  now  becomes 

^  E  4(y.-r,,-(xO)^V‘n,-(a:.){V,V^77((9,A,a;)}-»y«rj,(z.)  (18) 

(*V,V.)€" 

As  n  approaches  infinite,  d  approach  the  true  parameters  6o,  VsTjj(*.)  approaches 
^eVSoi^-)  and  E([y  —  rig{x)))^  asymptotically  equals  to  Using  lemma  4  and 
lemma  3,  we  get,  for  the  asymptotic  equivalency  of  equation  18, 

Y  2V‘r,,-(x.){V«V‘W(0,A,i^)}-^2V,r7,-(z,)  (29) 

(=.,v.)e‘>' 

If  we  use  notation  €(d,w)  -  ^  £{yi,v${^i)),  with  £(y,r)e{x))  of  the  form 

specified  in  equation  17,  we  can  get, 

—  VgnS{6,uj)  =  -2Vsr)e{xi)  (20) 

Combining  this  with  equation  19  and  equation  11,  we  can  readily  obtain  equation  12. 

5  SUMMARY 

In  this  paper,  we  used  asymptotics  to  obtain  the  jackknife  estimator,  which  can 
be  used  to  get  the  fit  of  a  model  by  plugging  it  into  the  model  selection  criterion. 
Based  on  the  idea  of  the  cross-validation  method,  we  used  the  negative  of  the 
average  predicative  likelihood  as  the  model  selection  criterion.  We  also  obtained 
the  asymptotic  form  of  the  model  selection  criterion  and  proved  that  when  the 
parameters  optimization  criterion  is  the  mean  squared  error  plus  a  penalty  term, 
this  asymptotic  form  is  the  same  as  the  form  presented  by  (Moody,  1992).  This 
also  served  to  prove  the  asymptotic  equivalence  of  this  criterion  to  the  method  of 
cross-validation. 
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